One Tokenization per Source

نویسنده

  • Jin Guo
چکیده

We report in this paper the observation of one tokenization per source. That is, the same critical fragment in different sentences from the same source almost always realize one and the same of its many possible tokenizations. This observation is demonstrated very helpful in sentence tokenization practice, and is argued to be with far-reaching implications in natural language processing. 1 I n t r o d u c t i o n This paper sets to establish the hypothesis of one tokenization per source. That is, if an ambiguous fragment appears two or more times in different sentences from the same source, it is extremely likely that they will all share the same tokenization. Sentence tokenization is the task of mapping sentences from character strings into streams of tokens. This is a long-standing problem in Chinese Language Processing, since, in Chinese, there is an apparent lack of such explicit word delimiters as white-spaces in English. And researchers have gradually been turning to model the task as a general lexicalization or bracketing problem in Computational Linguistics, with the hope that the research might also benefit the study of similar problems in multiple languages. For instance, in Machine Translation, it is widely agreed that many multiple-word expressions, such as idioms, compounds and some collocations, while not explicitly delimited in sentences, are ideally to be treated as single lexicalized units. The primary obstacle in sentence tokenization is in the existence of uncertainties both in the notion of words/tokens and in the recognition of words/tokens in context. The same fragment in different contexts would have to be tokenized differently. For instance, the character string todayissunday would normally be tokenized as "today is sunday" but can also reasonably be

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Optimizing Tokenization Choice for Machine Translation across Multiple Target Languages

Tokenization is very helpful for StatisticalMachine Translation (SMT), especiallywhen translating from morphologically rich languages. Typically, a single tokenization scheme is applied to the entire source-language text and regardless of the target language. In this paper, we evaluate the hypothesis that SMT performance may benefit from different tokenization schemes for different words within...

متن کامل

A CRF-Based System for Recognizing Chemical Entities in Biomedical Literature

One of tasks of the BioCreative IV competition, the CHEMDNER task, includes two subtasks: CEM and CDI. We participated in the later subtask, and developed a CEM recognition system on the basis of CRF approach and some open-source NLP toolkits. Our system processing pipeline consists of three major components: pre-processing (sentence detection, tokenization), recognition (CRF-based approach), a...

متن کامل

UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing

Automatic natural language processing of large texts often presents recurring challenges in multiple languages: even for most advanced tasks, the texts are first processed by basic processing steps – from tokenization to parsing. We present an extremely simple-to-use tool consisting of one binary and one model (per language), which performs these tasks for multiple languages without the need fo...

متن کامل

Translation Corpus Source and Size in Bilingual Retrieval

This paper explores corpus-based bilingual retrieval where the translation corpora used vary by source and size. We find that the quality of translation alignments and the domain of the bitext are important. In some settings these factors are more critical than corpus size. We also show that judicious choice of tokenization can reduce the amount of bitext required to obtain good bilingual retri...

متن کامل

Consistent and Flexible Integration of Morphological Annotation in the Arabic Treebank

Treebank Annotation Issue: Multiple Levels of Annotation • Annotation not on the source text, but more abstract representation • How to maintain annotation consistency and relation between different levels? • How to make available the multiple levels of representation for the user? Arabic Treebank as a case study: • Mapping between two levels of annotation: • Morphological analysis of source te...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998